Skip to content

Conversation

@GeigerJ2
Copy link
Contributor

@GeigerJ2 GeigerJ2 commented Jul 16, 2025

Adds a new --tmp-dir command line option to verdi archive create that allows users to specify a custom directory for temporary files during archive creation.

Changes:

  • New --tmp-dir CLI option with validation (must exist and be writable)
  • Enhanced error handling for disk space issue
  • Falls back to output file directory when not specified -> ! This is different from the previous behavior, where the file was always in /tmp !

Use cases:

  • File system has limited disk space
  • Use faster storage (e.g., SSD) for temporary operations
  • Need to control where temporary files are created for system administration reasons

Example:

verdi archive create --tmp-dir /fast/storage/tmp export.aiida

@codecov
Copy link

codecov bot commented Jul 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.03%. Comparing base (594cf56) to head (a515785).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6946      +/-   ##
==========================================
+ Coverage   79.03%   79.03%   +0.01%     
==========================================
  Files         566      566              
  Lines       43699    43721      +22     
==========================================
+ Hits        34531    34552      +21     
- Misses       9168     9169       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@GeigerJ2 GeigerJ2 force-pushed the feature/archive-base-tmp-dir branch 2 times, most recently from c116bc1 to 0fd924d Compare July 21, 2025 06:03
@GeigerJ2 GeigerJ2 force-pushed the feature/archive-base-tmp-dir branch from 777891b to 19c9dd3 Compare July 24, 2025 06:01
'output_id': d.target_id,
'label': d.link_label,
'type': d.link_type,
try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff is large here because of the additional try-except and the indent that goes with it (to capture disk-space errors) but the actual code inside should be the same!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git diff --ignore-all-space main src/aiida/tools/archive/create.py gives instead only:

diff --git a/src/aiida/tools/archive/create.py b/src/aiida/tools/archive/create.py
index 94ca88cd4..c4f0671d5 100644
--- a/src/aiida/tools/archive/create.py
+++ b/src/aiida/tools/archive/create.py
@@ -12,6 +12,7 @@ The archive is a subset of the provenance graph,
 stored in a single file.
 """
 
+import os
 import shutil
 import tempfile
 from datetime import datetime
@@ -59,6 +60,7 @@ def create_archive(
     compression: int = 6,
     test_run: bool = False,
     backend: Optional[StorageBackend] = None,
+    tmp_dir: Optional[Union[str, Path]] = None,
     **traversal_rules: bool,
 ) -> Path:
     """Export AiiDA data to an archive file.
@@ -139,6 +141,12 @@ def create_archive(
 
     :param backend: the backend to export from. If not specified, the default backend is used.
 
+    :param tmp_dir: Directory to use for temporary files during archive creation.
+        If not specified, a temporary directory will be created in the same directory as the output file
+        with a '.aiida-export-' prefix. This parameter is useful when the output directory has limited
+        space or when you want to use a specific filesystem (e.g., faster storage) for temporary operations.
+        The directory must exist and be writable.
+
     :param traversal_rules: graph traversal rules. See :const:`aiida.common.links.GraphTraversalRules`
         what rule names are toggleable and what the defaults are.
 
@@ -280,10 +288,32 @@ def create_archive(
 
     EXPORT_LOGGER.report(f'Creating archive with:\n{tabulate(count_summary)}')
 
+    # Handle temporary directory configuration
+    tmp_prefix = '.aiida-export-'
+    if tmp_dir is not None:
+        tmp_dir = Path(tmp_dir)
+        if not tmp_dir.exists():
+            EXPORT_LOGGER.warning(f"Specified temporary directory '{tmp_dir}' doesn't exist. Creating it.")
+            tmp_dir.mkdir(parents=True)
+        if not tmp_dir.is_dir():
+            msg = f"Specified temporary directory '{tmp_dir}' is not a directory"
+            raise ArchiveExportError(msg)
+        # Check if directory is writable
+        # Taken from: https://stackoverflow.com/a/2113511
+        if not os.access(tmp_dir, os.W_OK | os.X_OK):
+            msg = f"Specified temporary directory '{tmp_dir}' is not writable"
+            raise ArchiveExportError(msg)
+
+    else:
+        # Create temporary directory in the same folder as the output file
+        tmp_dir = filename.parent
+
     # Create and open the archive for writing.
     # We create in a temp dir then move to final place at end,
     # so that the user cannot end up with a half written archive on errors
-    with tempfile.TemporaryDirectory() as tmpdir:
+    try:
+        tmp_dir.mkdir(parents=True, exist_ok=True)
+        with tempfile.TemporaryDirectory(dir=tmp_dir, prefix=tmp_prefix) as tmpdir:
             tmp_filename = Path(tmpdir) / 'export.zip'
             with archive_format.open(tmp_filename, mode='x', compression=compression) as writer:
                 # add metadata
@@ -302,7 +332,9 @@ def create_archive(
                     }
                 )
                 # stream entity data to the archive
-            with get_progress_reporter()(desc='Archiving database: ', total=sum(entity_counts.values())) as progress:
+                with get_progress_reporter()(
+                    desc='Archiving database: ', total=sum(entity_counts.values())
+                ) as progress:
                     for etype, ids in entity_ids.items():
                         if etype == EntityTypes.NODE and strip_checkpoints:
 
@@ -359,7 +391,9 @@ def create_archive(
 
                 # stream node repository files to the archive
                 if entity_ids[EntityTypes.NODE]:
-                _stream_repo_files(archive_format.key_format, writer, entity_ids[EntityTypes.NODE], backend, batch_size)
+                    _stream_repo_files(
+                        archive_format.key_format, writer, entity_ids[EntityTypes.NODE], backend, batch_size
+                    )
 
                 EXPORT_LOGGER.report('Finalizing archive creation...')
 
@@ -368,6 +402,16 @@ def create_archive(
 
             filename.parent.mkdir(parents=True, exist_ok=True)
             shutil.move(tmp_filename, filename)
+    except OSError as e:
+        if e.errno == 28:  # No space left on device
+            msg = (
+                f"Insufficient disk space in temporary directory '{tmp_dir}'. "
+                f'Consider using --tmp-dir to specify a location with more available space.'
+            )
+            raise ArchiveExportError(msg) from e
+
+        msg = f'Failed to create temporary directory: {e}'
+        raise ArchiveExportError(msg) from e
 
     EXPORT_LOGGER.report('Archive created successfully')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification @GeigerJ2! This'll help in my review <3

@GeigerJ2 GeigerJ2 marked this pull request as ready for review July 24, 2025 14:59
@GeigerJ2 GeigerJ2 requested a review from mbercx July 24, 2025 14:59
@GeigerJ2 GeigerJ2 changed the title Customize tmp dir during archive creation Add --tmp-dir option for archive creation Jul 24, 2025
@GeigerJ2 GeigerJ2 force-pushed the feature/archive-base-tmp-dir branch from d706758 to 3bb71b5 Compare July 25, 2025 05:54
@GeigerJ2 GeigerJ2 force-pushed the feature/archive-base-tmp-dir branch 3 times, most recently from b7edfa7 to 4a1c79d Compare August 13, 2025 13:00
@mbercx
Copy link
Member

mbercx commented Aug 21, 2025

Dogfooding:

Note: I've discovered a different issue while dogfooding, see #6982

  1. Non-existing temporary directory ✅
> verdi archive create --tmp-dir /nonexistent -G workchain/PBEsol/bader/lumi bader.aiida
Usage: verdi archive create [OPTIONS] [--] OUTPUT_FILE
Try 'verdi archive create --help' for help.

Error: Invalid value for '--tmp-dir': Directory '/nonexistent' does not exist.
  1. No temporary directory specified ✅
> verdi archive create -G workchain/PBEsol/scf/gauss -- test.aiida
  1. Temporary directory specified ✅
> verdi archive create -G workchain/PBEsol/scf/gauss --tmp-dir ~/tmp -- test.aiida
  1. Temporary directory not specified, but different parent folder for archive ✅
> verdi archive create -G workchain/PBEsol/scf/gauss  -- ~/tmp/test.aiida

I checked, and the temporary directories are indeed created in the desired locations. 👌

Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeigerJ2! Very happy to have this feature, and that the default path is now not in /tmp. I think having the parent of the filename makes a lot more sense.

Before you look at my comments: one thing I'm wondering is if that is the default, would there be any reason to specify a different tmp_dir? If I look at the code in more detail, the only reason it is creating a temporary directory is because we don't want to leave behind a partially written archive. In the end, the tmp_filename is simply moved to the target filename with shutil.move, and then the temporary directory is deleted. The target directory must have enough space for this to work.

I think @giovannipizzi suggested to have a --tmp-dir option, so the user could specify it. But now that I see the code in more detail, I don't see the use case. I think just changing the default to the parent of the filename of the archive is fine.

@click.option(
'--tmp-dir',
type=click.Path(exists=True, file_okay=False, dir_okay=True, writable=True, path_type=Path),
help='Directory to use for temporary files during archive creation. '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small concern here is that it sounds like the directory with the .aiida-export- prefix is only created when the parameter is not specified, and that in case it is the provided directory replaces that one. I wonder if the following text is not more clear:

Location where the temporary directory will be written during archive creation.
The directory must exist and be writable, and defaults to the parent directory of the output file.
This parameter is useful when the output directory has limited space or when you want to use a specific filesystem (e.g., faster storage) for temporary operations.

:param backend: the backend to export from. If not specified, the default backend is used.
:param tmp_dir: Directory to use for temporary files during archive creation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

readonly_tmp.mkdir()

# Make directory read-only
readonly_tmp.chmod(stat.S_IRUSR | stat.S_IRGRP | stat.S_IROTH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have use stat here? You should also be able to pass an octal (I figured out recently ^^):

readonly_tmp.chmod(0o444)

if tmp_dir is not None:
tmp_dir = Path(tmp_dir)
if not tmp_dir.exists():
EXPORT_LOGGER.warning(f"Specified temporary directory '{tmp_dir}' doesn't exist. Creating it.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour is different from the CLI command, that does expect the temporary directory to exist. It also seems to conflict with the docstring of the function.

EXPORT_LOGGER.report(f'Creating archive with:\n{tabulate(count_summary)}')

# Handle temporary directory configuration
tmp_prefix = '.aiida-export-'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why define this variable, instead of just passing its value in the TemporaryDirectory constructor?

'label': d.link_label,
'type': d.link_type,
try:
tmp_dir.mkdir(parents=True, exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this line added here? Everything related to checking the tmp_dir input should be above, and for the default there should be no need to create the parent directory, the command should fail in case that doesn't exist.

if e.errno == 28: # No space left on device
msg = (
f"Insufficient disk space in temporary directory '{tmp_dir}'. "
f'Consider using --tmp-dir to specify a location with more available space.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this can also happen when the archive_create function is used directly:

Suggested change
f'Consider using --tmp-dir to specify a location with more available space.'
f'Consider using `tmp-dir` to specify a location with more available space.'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still checking the code above in more detail, but what happens if there is enough space in the tmp_dir, but not in the target directory for the archive? Will the shutil.move command also fail with OSError with number 28? In that case, won't we be misleading the user?


EXPORT_LOGGER.report('Finalizing archive creation...')

if filename.exists():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not part you just put in the try-except block, but it's interesting. I'd move this logic to the lines after the comment on line 167 (#check/set archive file path), since that is dealing with the target filename of the archive existing and whether or not to overwrite it.

if filename.exists():
filename.unlink()

filename.parent.mkdir(parents=True, exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not your code, but something to think about: Apparently you can just write the archive to a folder that doesn't exist and the command will create it and all non-existent parents. I think that's ok. But the behaviour is then different from the tmp_dir one. I suppose there you don't want to just create some directory with a bunch of parents that you then have to clean up after?

@GeigerJ2 GeigerJ2 removed their assignment Aug 21, 2025
@danielhollas
Copy link
Collaborator

Falls back to output file directory when not specified -> ! This is different from the previous behavior, where the file was always in /tmp !
Thanks @GeigerJ2! Very happy to have this feature, and that the default path is now not in /tmp. I think having the parent of the filename makes a lot more sense.

Just for my curiosity, what were the problems of using /tmp?

@GeigerJ2
Copy link
Contributor Author

Just for my curiosity, what were the problems of using /tmp?

Running out of disk space on the machine running AiiDA due to creation of the temporary archive file.

@GeigerJ2 GeigerJ2 force-pushed the feature/archive-base-tmp-dir branch from 3963661 to 23ca013 Compare September 11, 2025 07:30
@GeigerJ2 GeigerJ2 force-pushed the feature/archive-base-tmp-dir branch from 23ca013 to a515785 Compare September 18, 2025 05:07
@mbercx
Copy link
Member

mbercx commented Sep 19, 2025

@GeigerJ2 let me know when you want me to review this one again. Unless I am re-requested a review, I typically ignore open PRs that are changing.

@GeigerJ2
Copy link
Contributor Author

@GeigerJ2 let me know when you want me to review this one again. Unless I am re-requested a review, I typically ignore open PRs that are changing.

Hi @mbercx, thank you for checking on this again. I'd like to finalize the decision if we even keep the --tmp-dir option, then add the necessary changes. Will start the discussion on slack again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose option to set custom base temporary directory for archive export

3 participants